Relative Entropy Policy Search

نویسندگان

Jan Peters

Katharina Mülling

Yasemin Altun

چکیده

This technical report describes a cute idea of how to create new policy search approaches. It directly relates to the Natural Actor-Critic methods but allows the derivation of one shot solutions. Future work may include the application to interesting problems. 1 Problem Statement In reinforcement learning, we have an agent which is in a state s and draws actions a from a policy π. Upon an action, it received a reward r (s, a) = Rsa and transfers to a next state s′ where it will do a next action a′. In most cases, we have Markovian environments and policies, where s′ ∼ p(s′|s, a) = Ps sa and a ∼ π(a|s). The goal of all reinforcement learning methods is the maximization of the expected return J̄(π) = E {∑T t=0 r(st, at) } . (1) We are generally interested in two cases, i.e., (i) the episodic open loop case where the system is always restarted from initial state distribution p(s0), and (ii) the stationary infinite horizon case where T → ∞. Both have substantial differences in their mathematical treatment as well as their optimal solution. 1.1 Episodic Open-Loop Case In the episodic open-loop case, a distribution p(τ) over trajectories τ is assumed and a return R(τ) of a trajectory τ , both are given by p(τ) = p(s0) ∏T t=1 p(st+1|st, at)π(at|t), (2) R(τ) = ∑T t=0 r(st, at). (3) The expected return can now be given as J̄(π) = ∑ τ p(τ)R(τ). Note, that all approximations to the optimal policy depend on the initial state distribution p(s0). This case has been predominant in our previous work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online learning in episodic Markovian decision processes by relative entropy policy search

We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a ...

متن کامل

Learning to Serve and Bounce a Ball

In this paper we investigate learning the tasks of ball serving and ball bouncing. These tasks display characteristics which are common in a variety of motor skills. To learn the required motor skills for these tasks the robot uses Relative Entropy Policy Search which is a state of the art method in Policy Search Reinforcement Learning. Our experiments show that REPS does not only converge cons...

متن کامل

Twenty Questions for Localizing Multiple Objects by Counting: Bayes Optimal Policies for Entropy Loss

We consider the problem of twenty questions with noiseless answers, in which we aim to locate multiple objects by querying the number of objects in each of a sequence of chosen sets. We assume a joint Bayesian prior density on the locations of the objects and seek to choose the sets queried to minimize the expected entropy of the Bayesian posterior distribution after a fixed number of questions...

متن کامل

Some properties of the parametric relative operator entropy

The notion of entropy was introduced by Clausius in 1850, and some of the main steps towards the consolidation of the concept were taken by Boltzmann and Gibbs. Since then several extensions and reformulations have been developed in various disciplines with motivations and applications in different subjects, such as statistical mechanics, information theory, and dynamical systems. Fujii and Kam...

متن کامل

Policy Search with High-Dimensional Context Variables

Direct contextual policy search methods learn to improve policy parameters and simultaneously generalize these parameters to different context or task variables. However, learning from high-dimensional context variables, such as camera images, is still a prominent problem in many real-world tasks. A naive application of unsupervised dimensionality reduction methods to the context variables, suc...

متن کامل

Observational Modeling of the Kolmogorov-Sinai Entropy

In this paper, Kolmogorov-Sinai entropy is studied using mathematical modeling of an observer $ Theta $. The relative entropy of a sub-$ sigma_Theta $-algebra having finite atoms is defined and then the ergodic properties of relative semi-dynamical systems are investigated. Also, a relative version of Kolmogorov-Sinai theorem is given. Finally, it is proved that the relative entropy of a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Relative Entropy Policy Search

نویسندگان

چکیده

منابع مشابه

Online learning in episodic Markovian decision processes by relative entropy policy search

Learning to Serve and Bounce a Ball

Twenty Questions for Localizing Multiple Objects by Counting: Bayes Optimal Policies for Entropy Loss

Some properties of the parametric relative operator entropy

Policy Search with High-Dimensional Context Variables

Observational Modeling of the Kolmogorov-Sinai Entropy

عنوان ژورنال:

اشتراک گذاری